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Abstract 

We  develop  an  efficient,  Bayesian  Uncertainty  Quantification  framework  us¬ 
ing  a  novel  treed  Gaussian  process  model.  The  tree  is  adaptively  constructed 
using  information  conveyed  by  the  observed  data  about  the  length  scales  of 
the  underlying  process.  On  each  leaf  of  the  tree,  we  utilize  Bayesian  Experi¬ 
mental  Design  techniques  in  order  to  learn  a  multi-output  Gaussian  process. 
The  constructed  surrogate  can  provide  analytical  point  estimates,  as  well  as 
error  bars,  for  the  statistics  of  interest.  We  numerically  demonstrate  the 
effectiveness  of  the  suggested  framework  in  identifying  discontinuities,  local 
features  and  unimportant  dimensions  in  the  solution  of  stochastic  differential 
equations. 
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1.  Introduction 

Uncertainty  Quantification  (UQ)  is  a  field  of  great  importance  in  practi¬ 
cally  all  engineering  tasks.  Physical  models  require  as  input  certain  param¬ 
eters  such  as  physical  constants,  equations  of  state,  geometric  specification 
of  objects,  boundary  conditions,  initial  conditions  and  so  on.  In  general, 
exact  knowledge  of  these  quantities  is  impossible  either  due  to  measurement 
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errors  or  because  they  are  truly  random.  As  a  consequence,  both  the  input 
parameters  as  well  as  the  physical  responses  have  to  be  modeled  as  random 
variables.  The  goal  of  UQ  is  to  study  the  propagation  of  uncertainty  from 
the  parameter  space  to  the  response  space.  The  most  celebrated  method  for 
the  solution  of  UQ  problems  is  the  Monte  Carlo  (MC)  method.  MC’s  wide 
acceptance  is  due  to  the  fact  that  it  can  uncover  the  complete  statistics  of 
the  solution,  while  having  a  convergence  rate  that  is  (remarkably)  indepen¬ 
dent  of  the  input  dimension.  Nevertheless,  it  quickly  becomes  inefficient  in 
high  dimensional  and  computationally  intensive  problems,  where  only  a  few 
samples  can  be  observed.  Such  difficulties  have  been  (partially)  alleviated 
by  improved  sampling  techniques  such  as  Latin  hypercube  sampling  [1]  and 
multilevel  MC  [2,  3]. 

Another  approach  to  UQ  is  the  so  called  spectral  finite  element  method  [4], 
It  involves  the  projection  of  the  response  on  a  space  spanned  by  orthog¬ 
onal  polynomials  of  the  random  variables  and  the  solution  of  a  system  of 
coupled  deterministic  equations  involving  the  coefficients  of  these  polyno¬ 
mials.  The  scheme  was  originally  developed  for  Gaussian  random  variables 
which  correspond  to  Hermite  polynomials  (polynomial  chaos  (PC)).  It  was 
later  generalized  to  include  other  types  of  random  variables  (generalized  PC 
(gPC))  [5].  Due  to  the  global  support  of  the  polynomials  used,  gPC  suffers 
from  the  well-known  Gibbs  phenomenon  in  the  presence  of  discontinuities 
in  the  random  space.  The  multi-element  generalized  polynomial  chaos  (ME- 
gPC)  method  [6,  7]  was  introduced  in  order  to  address  exactly  this  issue. 
The  idea  of  the  multi-element  (ME)  approach  is  to  decompose  the  stochastic 
space  in  disjoint  elements  and  then  employ  gPC  on  each  element.  However, 
the  coupled  nature  of  the  equations  that  determine  the  coefficients  of  the 
polynomials  make  the  application  of  the  method  to  high  input  dimensions 
extremely  difficult  ( curse  of  dimensionality). 

Throughout  the  paper,  we  assume  that  we  have  at  hand  a  well-established 
computer  code  that  emulates  the  physical  system.  In  fact,  we  will  investi¬ 
gate  the  propagation  of  uncertainty  from  the  input  of  the  computer  code  to 
the  output,  by  learning  the  response  surface  using  well  selected  observations. 
Any  modeling  or  discretization  error  will  be  ignored  in  this  study.  The  so 
called  stochastic  collocation  methods  have  been  designed  to  deal  with  this  sit¬ 
uation.  The  response  is  represented  as  an  interpolative  polynomial  of  the  ran¬ 
dom  input  constructed  by  calls  to  the  computer  code  at  specific  input  points. 
However,  the  construction  of  the  set  of  interpolation  points  is  non-trivial,  es¬ 
pecially  in  high-dimensional  settings.  In  [8],  a  Galerkin  based  approximation 
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was  introduced  in  conjunction  with  a  collocation  scheme  based  on  a  tensor 
product  rule  using  one- dimensional  Gauss  quadrature  points.  Despite  its  ap¬ 
peal,  the  method  scales  badly  with  the  number  of  random  input  dimensions. 
Alternatively,  sparse  grids  (SG)  based  on  the  Smolyak  algorithm  [9]  have  a 
weaker  dependence  on  the  input  dimensionality.  In  [10,  11,  12],  the  Smolyak 
algorithm  is  employed  to  build  sparse  grid  interpolants  in  high- dimensional 
input  spaces  based  on  Lagrange  interpolation  polynomials.  Similarly  to  gPC, 
such  methods  also  fail  to  capture  local  features  of  the  response.  From  the 
above  discussion,  it  is  apparent  that  discontinuities  in  the  stochastic  space 
must  be  dealt  with  using  a  basis  with  local  support.  In  [13],  the  authors  de¬ 
veloped  an  adaptive  version  of  SG  collocation  (SGC)  based  on  localized  hat 
functions  called  Adaptive  SGC  (ASGC).  ASGC  is  able  to  refine  the  sparse 
grid  only  in  important  regions  of  the  stochastic  space,  e.g.  near  a  discon¬ 
tinuity.  Nevertheless,  the  piecewise  linear  nature  of  the  scheme  performs 
poorly  when  only  a  few  samples  are  used  while  adverse  functions  can  trick 
the  adaptive  algorithm  into  stopping  without  converging. 

Highly  sophisticated  computer  codes  modeling  real-life  phenomena  (like 
weather,  ocean  waves,  earthquakes,  etc.)  might  take  hours  or  even  days 
to  complete  a  single  run  in  massively  parallel  systems.  Therefore,  we  are 
necessarily  limited  to  observing  only  a  few  realizations.  Motivated  by  this 
situation,  we  would  like  to  consider  the  problems  of  (1)  selecting  the  most 
informative  observations  and  (2)  quantifying  the  uncertainty  in  the  predic¬ 
tion  of  the  statistics.  From  the  above  mentioned  methods,  ASGC  addresses 
only  problem  (1),  albeit  in  an  ad  hoc  manner.  In  order  to  deal  with  (1) 
and  (2)  in  a  principled,  information  theoretic  way,  a  Bayesian  framework 
is  necessary.  To  this  end,  we  choose  to  investigate  the  performance  of  the 
Gaussian  process  (GP)  model.  The  GP  model  has  been  used  in  computer 
experiments  in  the  pioneering  work  of  Sacks  [14]  (for  a  more  recent  review 
see  the  book  [15]).  GP  is  particularly  interesting,  since  it  provides  an  an¬ 
alytically  tractable  Bayesian  framework  where  prior  information  about  the 
response  surface  can  be  encoded  in  the  covariance  function,  and  the  uncer¬ 
tainty  about  the  prediction  is  easily  quantified.  It  is  exactly  this  uncertainty 
in  the  prediction  that  can  be  exploited  in  order  to  select  the  observations  to 
be  made  (see  [16]),  as  well  as  to  quantify  the  uncertainty  in  the  statistics. 
One  of  the  drawbacks  of  GP  inference  is  that  it  scales  as  the  cube  of  the 
number  of  observations,  making  the  treatment  of  large  data  sets  computa¬ 
tionally  demanding.  Furthermore,  the  most  common  covariance  functions 
used  in  practice  are  stationary.  The  effect  of  the  stationarity  assumption  is 


3 


that  it  makes  non-stationary  responses  and  localized  features  (such  as  dis¬ 
continuities)  a  priori  highly  improbable,  resulting  in  an  excessive  number 
of  samples  being  required  in  order  to  uncover  them.  A  successful  effort  to 
deal  with  these  difficulties  has  been  carried  out  in  [17].  Based  on  the  parti¬ 
tioning  ideas  of  the  Bayesian  CART  model  [18,  19],  a  treed  GP  model  was 
introduced.  By  making  the  GP  local  to  each  leaf  of  the  tree,  the  model  is 
able  to  process  many  more  samples.  Additionally,  anisotropy  is  captured  by 
considering  the  true  response  as  being  the  result  of  many  local  stationary 
(albeit  different)  models.  More  recently,  in  [20]  a  novel  tree  model  was  intro¬ 
duced  using  Sequential  Monte  Carlo  inference  as  opposed  to  MCMC  of  the 
classical  approaches.  The  latter  is  a  promising  step  towards  computationally 
tractable  fully  Bayesian  trees. 

In  this  work,  we  present  a  novel  non-intrusive  UQ  framework  based  on 
a  treed  multi-output  Gaussian  process  (GP).  It  operates  in  two  stages:  (a) 
the  construction  of  a  surrogate  model  for  the  physical  response  and  (b)  the 
interrogation  of  this  surrogate  for  the  statistics.  The  building  block  of  the  sur¬ 
rogate  is  a  Multi-output  Gaussian  Process  (MGP)  introduced  in  Section  2.1. 
Information  gathered  from  the  MGP  is  used  to  discover  important  directions 
of  the  stochastic  space  and  decompose  it  in  stochastic  dements  (i.e.  new 
leaves  of  the  tree)  (Section  2.4).  Each  stochastic  element  is,  in  turn,  sampled 
using  Sequential  Experimental  Design  (SED)  techniques  (Section  2.5)  and 
subsequently  modeled  using  a  new  MGP.  This  defines  an  iterative  procedure 
that  gradually  resolves  local  features  and  discontinuities.  The  final  result  is 
a  piecewise  surrogate  in  the  spirit  of  the  Multi-element  Method  (ME)  [6]. 
Despite  being  a  treed  GP,  our  model  differs  from  the  model  in  [17]  in  several 
aspects:  1)  the  tree  building  process  is  inspired  from  the  ME  method  rather 
than  Bayesian  CART  (non- probabilistic  tree),  2)  we  explicitly  derive  point 
estimates  of  the  missing  hyper-parameters  by  maximizing  the  marginal  likeli¬ 
hood  instead  of  averaging  (fast  predictions),  3)  we  treat  the  multiple  outputs 
of  the  response  in  a  unified  way  (faster  training).  Furthermore,  our  model  is 
built  specifically  to  deal  with  UQ  tasks,  in  that  the  input  probability  distri¬ 
bution  plays  an  important  role  in  the  tree  construction.  Finally,  the  resulting 
surrogate  can  be  used  to  obtain  semi-analytic  estimates  of  the  moments  of 
any  order  as  well  as  error  bars  (Sections  2.2  and  2.3). 
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2.  Methodology 

Let  X  C  MA  for  some  K  >  1  represent  the  stochastic  input  space,  a 
(potentially  infinite)  rectangle  of  M.K,  i.e.  X  =  x£=1[ak,bk\,  — oo  <  a k  <  bk  < 
oo.  We  will  assume  that  there  is  a  probability  density  p(x)  defined  for  all 
x  G  X  such  that: 

K 

p(x)  =  T[Pk(xk),  (1) 

k= 1 

where  Pk(%k )  is  the  probability  density  pertaining  to  the  k-th  dimension. 
That  is,  the  components  of  x  are  independent  random  variables.  This  as¬ 
sumption  is  very  common  in  UQ  settings  and  can  be  made  to  hold  by  a 
transformation  of  the  input  space.  We  now  consider  the  multi-output  func¬ 
tion  f  :  X  — y  Mm  representing  the  result  of  a  computer  code  (deterministic 
solver)  modeling  a  physical  system,  i.e.  at  a  given  input  point  x  G  X  the 
response  of  the  system  is  f(x).  We  will  write 

f(-)  =  (/i 

and  refer  to  /,.(•)  as  the  r-th  output  of  the  response  function,  r  —  1, ,  M.  In 
this  work,  we  will  identify  f(-)  as  the  true  response  of  an  underlying  physical 
system  and  we  will  ignore  any  modeling  errors.  The  input  probability  dis¬ 
tribution  induces  a  probability  distribution  on  the  output.  The  UQ  problem 
involves  the  calculation  of  the  statistics  of  the  output  y  =  f(x).  Quanti¬ 
ties  of  interest  are  the  moments  m9  =  (m\, . . .  ,  defined  for  q  >  1  and 
r  =  1, . . . ,  M  by: 

mr-=  [  /r(XMXMX>  (2) 

dx 

as  well  as  functions  of  them.  In  particular,  the  mean  m  =  (mi, . . . ,  m^): 

mr  :=  ml  =  [  /r(x)p(x)dx,  (3) 

./x 

and  the  variance  v  =  (ly, . . . ,  %): 

Vr  ■■=  [  (/r(x)  -  mr)2p(x)dx  =  17%  ~  ( 771* )2.  (4) 

dx 

The  statistics  will  be  calculated  by  interrogating  a  surrogate  of  f  :  X  — > 
Mm.  This  will  be  put  together  from  local  surrogates  defined  over  elements  of 
the  stochastic  space  X*  C  X  such  that: 

X  =  U[=1X*  and  int(X?:)  n  int(XQ  =  0,  Vi,  jel,i^  j,  (5) 
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where  int(X*)  denotes  the  interior  of  the  set  X*  under  the  usual  Euclidean 
metric  of  MA .  The  response  surface  is  correspondingly  decomposed  as: 

i 

f(x):=Efwixi(x)>  (g) 

i=  1 

where  lx;(x)  is  the  indicator  function  of  X*,  given  by: 


lx<(z) 


l  if  x  e  Xh 

0  otherwise 


and  f*(-)  is  just  the  restriction  of  f(-)  on  X*.  The  local  surrogates  will  be  iden¬ 
tified  as  Multi-Output  Gaussian  Processes  (MGP)  defined  over  the  stochastic 
element  X*.  These  MGPs  will  be  trained  by  observing  P ( • ) .  The  predictive 
mean  of  the  MGPs  will  be  used  to  derive  semi-analytic  estimates  of  all  mo¬ 
ments  m9.  An  addendum  of  the  Bayesian  treatment,  is  the  ability  to  provide 
error  bars  for  the  point  estimates  of  the  moments.  This  feature  is  absent 
from  most  current  UQ  methods. 

Our  aim  is  to  create  a  surrogate  by  making  as  few  calls  to  the  computer 
program  as  possible.  This  is  achieved  by  an  interplay  of  adaptively  decom¬ 
posing  the  domain  (Tree  Construction)  and  selecting  which  observations  to 
make  within  each  element  (Experimental  Design).  These  decisions  should 
be  biased  by  the  underlying  input  probability  density  p(x)  and  the  observed 
variability  of  the  responses. 

In  the  sections  that  follow,  we  introduce  the  constituent  parts  of  our 
framework.  Despite  the  fact  that  the  method  is  applicable  to  any  distribution 
p(x)  over  X,  all  numerical  examples  will  be  conducted  on  a  compact  X  (a/. 
and  bk  are  finite)  using  the  uniform  distribution.  This  is  mainly  clue  to  the 
fact  that  the  implementation  of  the  framework  is  considerably  easier  for  this 
case.  We  plan  to  investigate  and  report  the  dependence  of  the  results  on 
p(x)  in  a  future  work. 


2.1.  Multi-output  Gaussian  Process  Regression 

We  turn  our  focus  to  a  single  element  of  the  stochastic  space  X*  C  X  and 
discuss  the  construction  of  a  local  surrogate  model  based  on  some  already 
observed  data.  The  choice  of  the  elements  is  the  subject  of  Section  2.4  and 
how  the  observations  are  selected  is  investigated  in  Section  2.5.  All  quantities 
introduced  herein  are  local  to  the  element  X*.  However,  in  order  to  avoid 
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having  an  unnecessarily  complicated  notation,  we  do  not  explicitly  show  this 
dependence. 

We  assume  that  we  have  observed  a  hxed  number  N  >  1  of  data  points 

(7) 

where,  =  f(x^)  is  the  result  of  the  computer  program  with  input  x*A>. 
We  will  fit  these  data  to  a  Gaussian  Process  (GP)  model  [21,  22],  a  proce¬ 
dure  known  as  GP  Regression  (GPR).  Our  primary  concern  in  this  section 
is  to  extend  GPR  to  the  multi-output  case.  The  naive  approach  would  be 
to  model  its  output  dimension  independently.  However,  since  the  various 
outputs  of  the  response  function  are  highly  correlated,  this  strategy  will  in¬ 
cur  some  loss  of  information.  Furthermore,  training  a  GP  on  N  data  points 
involves  the  computation  of  the  Cholesky  decomposition  of  an  N  x  N  sym¬ 
metric  positive-definite  matrix,  an  operation  that  scales  as  0(N3).  If  the  M 
output  dimensions  were  to  be  modeled  independently,  then  the  total  train¬ 
ing  cost  would  be  0(A4N3)  making  the  method  inappropriate  for  most  UQ 
tasks.  Several  techniques  exist  that  model  the  correlation  between  outputs: 
e.g.  ‘co-kriging’  (Section  3.2.3  in  [23])  or  introducing  latent  (hidden)  out¬ 
puts  [24,  25,  26].  Unfortunately,  these  models  are  still  fairly  complicated 
and  computationally  demanding.  In  [27],  a  principal  components  analysis 
(PGA)  was  performed  on  the  output  space  and  then  the  PCA  coefficients 
of  the  simulations  were  modeled  using  independent  GPs.  This  approach  has 
been  proven  efficient  in  dealing  with  high-dimensional  output  settings,  since 
it  automatically  takes  care  of  output  correlations.  However,  it  introduces  an 
additional  error  arising  from  the  finite  truncation  of  the  PCA  decomposition 
of  the  output  held.  Furthermore,  it  is  not  clear  how  the  approach  can  be 
used  in  a  SED  setting,  in  which  simulations  arrive  one  by  one,  as  well  as  how 
it  performs  when  discontinuities  are  present  in  the  stochastic  space.  A  very 
recent,  theoretically  sound  way  of  modeling  multiple  outputs  was  developed 
in  [28].  In  this  approach,  the  multidimensional  response  is  modeled  as  a  GP 
vector  using  the  same  covariance  function  for  each  dimension.  It  accounts 
for  correlations  by  introducing  a  constant  correlation  matrix  between  the 
outputs.  However,  in  very  high-dimensional  settings  (typical  UQ  applica¬ 
tions  have  a  few  thousand  outputs),  dealing  with  the  full  correlation  matrix 
is  computationally  challenging.  Since  in  this  work  we  are  trying  to  develop 
a  method  that  will  be  able  to  deal  with  output  dimensions  that  range  from 
a  few  hundreds  to  a  few  thousands,  keeping  the  training  time  to  acceptable 
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levels  is  one  of  our  major  goals.  We  achieve  this  by  making  a  compromise: 
the  outputs  will  be  treated  as  conditionally  independent  given  the  covariance 
function.  Our  approach  is  similar  to  that  in  [28]  if  a  diagonal  correlation 
matrix  and  a  constant  mean  is  used.  The  underlying  assumption  is  that  the 
regularity  of  all  output  dimensions  is  approximately  the  same.  Since  each 
output  may  vary  in  signal  strength  (e.g.  finite  element  nodes  close  to  a  fixed 
boundary  condition  exhibit  smaller  variations  compared  to  ones  in  the  middle 
of  the  domain),  we  have  to  work  with  a  scaled  version  of  the  responses.  The 
computational  savings  of  using  a  single  covariance  function  for  all  outputs 
are  tremendous:  only  a  single  Cholesky  decomposition  is  required,  dropping 
the  training  cost  back  to  0(N3).  We  call  the  resulting  model  a  Multi-output 
Gaussian  Process  (MGP)  and  refer  to  regression  using  MGPs  as  MGPR. 

Let  us  introduce  the  observed  mean : 

1  N 

=  (8) 

n= 1 

and  the  observed  variance: 


of  the  data  V.  We  will  be  modeling  the  scaled  response  functions  gr  :  X*  — >  M, 
defined  by: 

/  N.  /r(x)  —  Robs,r  1  ,r  /i  n\ 

0r(x)  = - ,r  =  l,  ...,M.  (10) 

^obs,r 

The  scaling  is  necessary,  because  the  various  outputs  might  exhibit  different 
signal  strengths.  Obviously,  this  definition  depends  on  the  actual  observa¬ 
tions.  We  expect,  however,  that  if  N  is  big  or  if  the  stochastic  element  under 
investigation  is  small,  then  it  is  a  good  approximation  to  the  ideal  scaling, 
i.e.  zero  mean  and  unit  variance  for  all  outputs.  Assuming  that  all  outputs 
have  the  same  regularity,  we  model  each  gr  as  a  Gaussian  Process  with  zero 
mean  and  covariance  function  c(x,  x';  6): 

5k(x)|0  ~  QV  (0  ,c(x,x';0)),r=  1, . . . ,  M, 

where  6  G  0  C  M5  are  the  S  >  1,  unknown  hyper-parameters  of  the  co- 
variance  function.  That  is,  the  scaled  responses  are  treated  as  conditionally 
independent  given  the  hyper-parameters. 
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Point  Estimates  of  the  Hyper-parameters.  A  fully  Bayesian  approach  would 
proceed  by  imposing  a  prior  probability  7 r(0)  over  the  hyper-parameters  and 
then  average  (numerically)  over  them.  Instead,  we  will  employ  the  evidence 
approximation  to  Bayesian  inference  [29],  in  order  to  obtain  point-estimates 
of  the  hyper-parameters  by  maximizing  the  marginal  likelihood  of  the  data 
(Ch.  5  of  [22]).  This  necessarily  underestimates  the  prediction  uncertainty, 
but  it  is  a  trade-off  we  are  willing  to  make  in  order  to  obtain  a  computation¬ 
ally  tractable  model.  The  logarithm  of  the  marginal  likelihood  of  each  scaled 
response  gr(-),r  =  1 , ,M  is  given  by: 

logp(zr|£>,0)  =  -iz^CTV-  ^  log  |C|  -  ylog2vr, 
where  z,r  =  (z^\  . . . ,  z[N^)  is  a  scaled  version  of  the  observations  in  T>: 

(n)  _ 

4”)  =  Vr  ~/Xobs>r,  n  =  1, . . . ,  N,  (11) 

^obs,r 

C  =  ( Cij),Cij  =  c(x^,x^;0)  is  the  covariance  matrix  and  |C|  its  determi¬ 
nant.  Since  the  scaled  responses  are  conditionally  independent  given  0,  the 
logarithm  of  the  joint  marginal  likelihood  is  simply  the  sum  of  the  marginal 
likelihoods  of  each  output,  i.e. 


C{9)  :=  logp(z1,...,zM|X,0) 


M 


^logp(zr|X,  0) 


r= 1 


M 


l^A  _ _ _  M  NM 

-?J2Z>C  Z'r  log  I C  | - —  log2vr. 


r— 1 


Thus,  a  point  estimate  of  6  over  the  element  X*  is  obtained  by 


6*  =  argma  xjC(O).  (12) 

ee©  v  7  7 

The  joint  marginal  likelihood  C(0)  might  exhibit  multiple  maxima  which 
correspond  to  alternative  interpretations  of  the  data.  In  practice,  we  make  an 
educated  initial  guess  and  we  are  satisfied  with  the  (local)  maximum  obtained 
using  a  Conjugate  Gradient  method  [30].  The  specifics  of  the  optimization 
method  are  discussed  in  Appendix  A. 


9 


The  Predictive  Distribution.  Having  decided  on  a  point  estimate  for  the 
hyper-parameters  6 ,  we  are  ready  to  predict  the  scaled  response  at  any  test 
point  x  G  X*.  Scaling  back  to  the  original  responses,  we  can  easily  see  that 
the  predictive  distribution  of  /r(x)  is: 


with  mean: 

and  variance: 


/r(x) \T>,  8*  ~  A/”(/i/r(x;  0*),(rjr(x:  0*))  , 


Tfr  (X’  0*)  ^obs,r^  C  Zr  “h  /^obs,r5 


a 


/,(*;  <**)  =  (c(x.x; s‘)  -  cTc  !c) , 


(13) 

(14) 

(15) 


where  c  =  (c(x,x(1^*)r..,c(x,xlJV^*))  and  the  covariance  matrix  C  is 
evaluated  at  0*.  We  will  refer  to  cr|r(x:  0*)  as  the  predictive  variance  of  the 
response  at  x.  It  represents  our  uncertainty  about  the  prediction  at  this 
particular  test  point. 

As  mentioned  earlier,  the  predictive  mean  /ijr(x;0*)  given  by  Eq.  (14) 
will  be  used  to  provide  estimates  for  the  statistics  over  the  element  X*,  while 
the  predictive  variance  ajr  (x:  6*)  will  give  error  bars  (see  Section  2.2).  Notice 
that  Hfr{x.]6*)  is,  in  fact,  a  kernel  estimator  since: 


N 

Pfr (x;  6*)  =  y^qrnc(x(ri),x;  6*)  +  p.obs,r,  (16) 

n= 1 


where  the  weights  arn  are  given  by: 


OLr  —  (cKri,  Otr 2,  •  •  •  ,  CKrj\r)  •  (^"obs,rC'  Zr 


and  also  depend  on  6*  through  C. 


2.2.  Calculation  of  the  local  statistics 

As  in  the  previous  section,  we  focus  on  a  specific  element  X*.  All  quan¬ 
tities  are  again  local  to  X*.  In  order  to  keep  notational  complexity  to  a 
minimum,  we  do  not  explicitly  show  this  dependence.  We  will  derive  an¬ 
alytic  point  estimates  as  well  as  error  bars  for  the  mean  and  the  higher 
moments  of  the  response  based  on  the  linear  point  estimator  of  f(-)  over  X* 
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given  in  Eq.  (16)  and  the  predictive  variance  Eq.  (15).  To  be  exact,  we  are 
interested  in  estimating  all  moments  m9  =  (m), . . . ,  mqM ),  q  >  1,  where 


mq  =  /  fr  (x)p'(x)dx. 


pl  :  X  — )■  R  is  the  conditional  probability  density  related  to  XL 

P(x) 


p  pcj  := 


P(Xl 


■lvi  X 


(17) 


(18) 


where  P(X*)  is  the  probability  of  an  input  point  residing  in  the  stochastic 
element  X*,  i.e. 

P(Xl)  =  f  p(x)dx. 

./x! 

In  order  to  achieve  analytic  estimates  of  m9,  we  keep  concurrent  MGP 
estimates  of  the  response  raised  to  the  q  power.  In  particular,  the  q  power 
of  the  response  is  treated  also  as  a  MGP  with  its  own  hyper-parameters  6q. 
Let  us  denote  the  predictive  distribution  for  the  q  power  of  the  response  at 
xGX'  by: 

/r  (x)  \Vi  eq  ~  M  (jiffix;  eq),  a2jq  (x;  6>9))  , 

where  /i^/(x:  Gq)  is  the  predictive  mean  and  cr^(x;  6q)  the  predictive  variance 
for  r  =  1, . . . ,  M.  These  quantities  are  available  through  the  exact  same 
procedure  described  in  Section  2.1,  using  the  q  power  of  the  response  instead 
of  the  response  itself.  For  convenience,  let  us  write  the  predictive  mean  at  x 
as: 

N 

G'f?  (xi  0q)  =  <nC(x(n\  x;  Qq)  +  » obs.ro 

n= 1 

and  the  predictive  variance  at  x  as: 


2  , 
af?{ 


x;  ^)  = 


— 


^obs.r)' 


c(x,x;  dq)-cq’T{Cq)-1cq)  , 


where  /iobsr  and  o"obsr  are  defined  as  in  Eqs.  (8)  and  (9),  respectively,  using 
the  q  power  of  the  observed  response,  cq  =  (c(x^,  x;  6q ), . . . ,  c(x(JV\  x;  Gq)) 
and  Cq  is  the  covariance  matrix  evaluated  at  Gq. 

Our  goal  is  to  derive  a  predictive  probability  distribution  for  the  mo¬ 
ments  m9  given  the  data  and  the  hyper-parameters.  In  a  proper  probabilis¬ 
tic  treatment,  we  would  proceed  by  sampling  the  full  posterior  of  the  MGP, 
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integrating  the  samples  over  x  and  producing  a  Monte  Carlo  estimate  of  the 
predictive  mean  and  variance  of  each  moment.  To  obtain  analytic  estimates, 
let  us  make  the  simplifying  assumption  that  the  predictions  at  different  in¬ 
put  points  x  are  conditionally  independent  given  the  data  and  the  hyper¬ 
parameters.  Then,  by  the  additivity  of  independent  normal  variables,  we 
arrive  at  the  approximation: 

K I'D,  0q  y  ,  (19) 

where  the  predictive  mean  of  rnq  is: 

fe?=  f  /V?(x;09y(x)dx,  (20) 

./X' 

and  its  predictive  variance: 

=  f  Offfc&WWdx-  (21) 

Jx,: 

Fortunately,  the  integrals  involved  can  be  expressed  in  terms  of  expectations 
of  the  covariance  function  with  respect  to  the  conditional  input  distribution. 
This  results  in  a  fast,  semi-analytic  estimate  of  fimQ  and  crmQ.  It  is  worth 
mentioning  at  this  point  that  this  distribution  is  necessarily  wider  than  the 
optimum  one. 

Remark  1.  Obviously,  the  assumption  that  a  positive  function,  e.g.  the 
response  fr  raised  to  an  even  power,  is  a  Gaussian  Process  is  not  optimal, 
since  the  predictive  distribution  assigns  positive  probability  to  the  event  of 
the  function  getting  negative  values.  However,  this  assumption  is  necessary  in 
order  to  obtain  analytic  estimates  of  the  predictive  distribution  of  the  statis¬ 
tics.  A  direct  consequence  of  it  is  that  the  predictive  distribution  Eq.  (19) 
for  an  even  moment  has  also  positive  probability  of  being  negative.  A  tighter 
predictive  distribution  can  always  be  found  by  truncating  Eq.  (21)  below 
zero.  On  the  other  hand,  the  predictive  mean  of  an  even  moment  will  always 
be  positive. 

Evaluation  of  the  integrals.  We  now  proceed  to  the  calculation  of  the  integrals 
in  Eqs.  (20)  and  (21).  We  can  write  the  following: 

N 

hml  =  Y  arnen  +  ^  (22) 

n— 1 
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and 


(23) 


where 


N 

■.*  -  ^obs,A  l  L_  _  E(c%>»,  I . 


ai,  =  (o'.  V  I  r“ 


n,l= 1 


4  =  /  c(x(n),x;6>9y(x)dx, 


cq=  c(x,  x;  09)p*(x)dx, 


=  /  c(x,  x(n^;  09)c(x,  x(/);  9q)pl(x)dx, 


/X! 


(24) 

(25) 

(26) 


and  (C9)^1  is  the  nl  element  of  the  inverse  q  covariance  matrix  (C9)  h 
Thus,  computation  of  the  statistics  requires  the  evaluation  of  integrals  of 
the  form  of  Eqs.  (24),  (25)  and  (26).  In  Appendix  A,  we  provide  analytic 
formulas  for  their  calculation  for  the  special  case  of  uniform  input  distribution 
and  Squared  Exponential  (SE)  covariance  function.  For  the  SE  covariance 
function  but  arbitrary  input  probability  density  of  the  form  of  Eq.  (1),  their 
evaluation  requires  O(K)  one- dimensional  numerical  integrations. 


2.3.  From  local  to  global  statistics 

In  the  same  spirit  as  the  multi-element  methods  [6,  7,  31],  we  combine 
the  statistics  over  each  stochastic  element  in  order  to  obtain  their  global 
analogues.  Since  we  now  work  over  the  whole  domain,  we  will  explicitly  mark 
the  dependence  of  the  underlying  quantities  on  the  element  X* ,  i  =  1, ...  L 
Let  mq’1  be  the  q  moment  of  the  response  that  pertains  to  the  conditional 
probability  density  p*(x)  (Eq.  (17))  and  mq  be  the  global  one  (Eq.  (2)). 
Notice  that  rnq  can  be  decomposed  as 


m 


g 

r 


/r9(x)p(x)dx 

,  P(x 


£/  /. 

i=i  7x1 


P(X* 


^-dxP(Xi 


E  [  /rfl(x)pi(x)dxP(xi); 

i=i  7x1 
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mqr  =  ^mq/P{TC).  (27) 

i=  1 

Now,  assume  that  for  each  element  X® ,  i  =  1, ...  I  we  have  obtained  a  predic¬ 
tive  distribution  (Eq.  (19))  for  mq,t  and  let  its  predictive  mean  and  variance 
be  nmq,i  and  ( <rmq,i )2,  respectively  (Eqs.  (22)  and  (23)).  Assuming  condi¬ 
tional  independence  of  the  predictive  distributions  given  the  data  and  the 
hyper-parameters,  we  obtain  that: 


mqr\v,eq 


where  the  predictive  mean  is: 


i 

^)»r  —  ^  )’ 

2=1 


and  the  predictive  variance: 


(7 


2 

q 

mi- 


X 


P(X‘). 


(28) 


(29) 


(30) 


Again,  truncation  of  this  distribution  below  zero  for  even  q,  always  yields  an 
improved  estimator  (see  Remark  1). 

Finally,  we  derive  a  normal  approximation  to  the  predictive  distribution 
for  the  variance  of  the  response  v  =  (iq, . . . ,  %)  (defined  in  Eq.  (4)): 

vr  ~  N{iiVr,a2Vj)  .  (31) 


Under  the  assumption  of  conditional  independence  of  mq,  q  —  1,2,  the  pre¬ 
dictive  mean  of  vr  is  given  by: 

/V  :=  E  [m2r-(m1r)2\V,e1,02] 

=  E  [m2r  \V,  G\  e2}  -  E  [(ml)2 \V,  0\  02]  , 


or: 


2  2 

l-^Vr  Amp  Amp  V) 


(32) 
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where  E[-|X>,  O1,  62]  denotes  the  expectation  with  respect  to  the  joint  pre¬ 
dictive  distribution  for  m'  and  rnf .  Equivalently,  the  predictive  variance 
is: 

<?vr  '■=  V  [' mr  ~  (mr )2\v,  °2] 

=  v  [m2r  \v,  e\  e2]  +  v  [(ml)2  \v,  e\  e2] 

=  V  [ml \v, e\ e2}  +  E  [(ml)4 \v, e\ e2]  -  (e  [K)2|d, e\ e2])\ 


or: 


2  2 
°Vr  =  Gml 


T  4/i 


2  _2 
1<Ju1 

r  H'r 


+  Ki¬ 


rn 


where  V[-|Z>,  01,  02]  denotes  the  variance  with  respect  to  the  joint  predictive 
distribution  of  m)  and  m2. 

Let  us  end  this  section  by  mentioning  that  the  above  procedure  can  be 
easily  applied  to  obtain  normal  approximations  to  the  predictive  distributions 
of  any  centered  moment.  It  is  obvious  that  the  calculation  can  always  be 
casted  in  terms  of  moments  of  the  normal  distribution  which  are  readily 
available  using  the  confluent  hypergeometric  function  U (a,  b ,  x)  (see  Ch.  13 
of  [32]). 


2-4-  Adaptivity 

In  this  section,  we  develop  an  iterative  procedure  to  adaptively  decompose 
the  stochastic  space  in  smaller  elements.  The  initial  step  of  this  procedure 
starts  by  considering  a  single  element,  i.e.  X  itself.  Here,  we  assume  that  we 
are  already  given  a  decomposition  of  the  domain  as  well  as  a  local  surrogate 
model  on  each  element.  The  decision  we  wish  to  make  is  whether  or  not 
to  refine  a  given  element  and  in  which  way.  We  develop  refinement  criteria 
that  are  based  solely  on  information  gathered  by  the  current  surrogate  model 
and  no  further  calls  to  the  deterministic  solver  are  required.  The  Bayesian 
predictive  variance  Eq.  (15)  is  used  to  define  a  measure  of  our  uncertainty 
about  the  prediction  over  the  whole  domain  X.  We  show  how  this  measure 
can  be  broken  down  to  contributions  coming  from  each  element.  Based  on 
this  observation,  we  derive  a  criterion  that  suggests  refinement  of  an  element 
if  its  contribution  to  the  global  uncertainty  is  larger  than  a  pre-specihed 
threshold.  For  the  sake  of  simplicity,  we  only  consider  rectangular  elements 
and  refine  them  by  splitting  them  perpendicular  to  the  dimension  of  greatest 
importance  in  two  pieces  of  equal  probability.  The  importance  of  a  particular 
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dimension  is  characterized  by  its  length  scale.  The  length  scales  are  identified 
as  the  hyper-parameters  of  a  SE  covariance  function. 

Suppose  that  we  have  already  a  decomposition  of  the  stochastic  domain 
X  in  rectangular  elements  X*,  e.g. 

Xi  =  [a[,b\]x---x[aiK,biK}, 

with  a\  <  blk,k  =  1, . . . ,  K,  i  =  1 such  that  Eq.  (5)  holds.  Furthermore, 
assume  that  we  have  already  learnt  the  local  surrogates  on  each  element  X*. 
Let  cTp (x)  be  the  predictive  variance  of  the  r  =  1  output  of  the 

local  surrogate  of  f*  at  x  G  X*  (Eq.  (15)).  By  the  conditional  independence 
assumption  for  the  predictive  distribution  over  each  element  and  Eq.  (6),  the 
predictive  variance  of  the  r  =  1 , ,M  dimension  of  the  global  surrogate 
Ojr(x)  at  x  G  X  is  given  by: 

i 

°)r  (x)  =  a/‘  (x) 1 (x)  ■ ■  (34) 

i=  1 


Its  average  over  r, 


°f (x)  := 


is  a  measure  of  our  uncertainty  about  the  prediction  of  all  outputs  simulta¬ 
neously  at  the  test  point  x  G  X  .  Taking  the  expectation  of  this  quantity 
with  respect  to  the  input  probability  density  p(x),  we  obtain 


(35) 


This  quantity  is  a  measure  of  our  uncertainty  about  our  prediction  over  the 
whole  domain  X.  Notice  that,  in  cr|p,  the  uncertainty  of  the  model  at  x  is 
weighted  by  its  probability  of  occurrence  p(x).  Intuitively  speaking,  we  are 
willing  to  accept  a  somewhat  less  accurate  surrogate  in  regions  of  the  space 
occurring  with  lower  probability.  Using  Eq.  (34),  it  is  straightforward  to  see 
that: 

i 

<36> 

i=l 


where 


of  (x)p*(x)dx, 
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is  the  uncertainty  of  our  prediction  over  the  element  X*.  Making  use  of  Eq.  (21) 
for  q  —  1,  we  obtain  that: 


1 

M 


M 


y>\„ 

/  ^  mr 
r=  1 


(37) 


Hence,  crj?  pi  relates  directly  to  our  uncertainty  about  the  mean  response  cr2  1}i 
(Eq.  (23)).  Generalizing,  we  can  define  the  corresponding  uncertainties  for 
the  response  raised  to  the  q  >  1  power  (see  Section  2.3): 


<P:=5>i <38) 

i— 1 


where 


r= 1 


(39) 


This  measure  is  equivalent  to  our  uncertainty  about  the  q-th  moment  of  the 
response.  Our  idea  it  to  rehne  the  element  X*,  if  the  contribution  to  the 
global  uncertainty  coming  from  it,  is  greater  than  a  certain  threshold  6  >  0, 
that  is  we  rehne  X*  if: 


(Tf9  piP(xl)  >  5,  for  any  q  =  1,2,...,  (40) 

depending  on  how  many  moments  one  wishes  to  consider.  However,  in  the 
numerical  examples  of  the  present  work,  we  simply  use  the  criterion  for  q  —  1, 
despite  the  fact  that  we  report  also  the  variance.  We  plan  to  investigate  its 
dependence  on  q  in  a  later  work. 

The  above  criterion  specifies  whether  or  not  an  element  X*  should  be 
refined.  As  already  mentioned,  we  rehne  elements  by  cutting  them  in  equally 
probable  parts  perpendicular  to  ‘the  most  important  dimension’.  At  this 
point,  we  attempt  to  give  a  precise  meaning  to  the  concept  of  ‘the  most 
important  dimension’.  Towards  this  goal,  we  will  exploit  the  properties  of  a 
specihc  parametric  form  for  the  covariance  function,  the  Squared  Exponential 
(SE): 

cSe(x,x')  =  s2  exp  J2  - Xk  ^  j  ’  (41) 
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where  Sf  >  0  can  be  interpreted  as  the  signal  strength  and  £k  >  0  as  the 
length  scale  of  each  stochastic  input.  These  parameters  can  be  learnt  from 
the  data  by  using  the  evidence  approximation  (see  Section  2.1),  allowing  the 
determination  of  the  relative  importance  of  each  dimension.  The  technique  is 
called  automatic  relevance  determination  (ARD).  It  originated  in  the  Neural 
Networks  literature  [33]  and  was  later  extended  to  GP  Regression  [34],  We 
emphasize  that  a  unique  set  of  the  SE  hyper-parameters  is  learnt  on  each 
element  X*  (as  well  as  for  each  power  of  the  response,  f9,  that  we  take  into 
account).  Hence,  despite  the  fact  that  each  local  surrogate  is  a  stationary 
GP,  the  global  surrogate  is  non-stationary.  This  is  similar  in  spirit  to  the 
Bayesian  Treed  Gaussian  Process  Model  in  [17]. 

Let  us  explicitly  denote  the  learnt  length  scales  of  element  X*  correspond¬ 
ing  to  the  MGP  that  represents  f,  with  tk,  k  =  1, . . . ,  K .  The  length  scales  of 
the  powers  of  the  response,  f q,q  >  1,  are  not  involved  in  the  criterion  we  are 
about  to  formulate.  Furthermore,  let  us  introduce  the  probability  P k  that 
the  k-th  dimension  xk  of  a  random  input  point  x  G  X  falls  inside  X*: 

fbk 

pl  '■=  /  Pk{xk)dxk.  (42) 

In  general,  this  has  to  be  evaluated  numerically.  For  the  special  case  of 
uniform  distribution  on  a  rectangular  X,  we  obtain: 


We  define  the  importance  Pk  of  the  dimension  k  of  the  element  X*  to  be: 

II  =  PlK-  (43) 

Intuitively,  the  importance  of  a  particular  dimension  is  inversely  proportional 
to  the  inferred  length  scale  and  proportional  to  the  probability  mass  along 
that  dimension  trapped  within  the  stochastic  element.  Thus,  if  X*  needs 
refinement  (i.e.  satisfies  Eq.  (40)),  we  cut  it  perpendicular  to  the  most  im¬ 
portant  dimension  k *,  given  by: 

k*  =  arg  max  Pk.  (44) 

In  order  to  have  two  new  elements  with  the  same  probabilities  of  occur¬ 
rence,  the  splitting  point  is  given  by  the  median  of  the  marginal  conditional 
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distribution  of  X*  along  dimension  k,  plk(x *,)  defined  by: 


PkM  =  rbi  - l[albi]ixk)-  (45) 

fJ  PkK)dx'k 

k 

This  is  a  root  finding  problem  that  can  easily  be  solved  using  a  bisection 
algorithm.  For  the  special  case  of  the  uniform  distribution,  the  splitting 
point  trivially  becomes: 

xl  =  y(ak  +  K)- 

Remark  2.  The  particular  splitting  criterion  based  on  the  inferred  length 
scales  is  not  the  only  possibility.  Despite  being  intuitively  appealing,  it  re¬ 
mains  an  ad  hoc  choice.  Nevertheless,  its  computational  evaluation  time  is 
negligible  and  we  have  empirically  shown  that  it  results  in  decompositions 
that  concentrate  around  important  features  of  the  response.  Of  course,  its 
performance  depends  crucially  on  predicting  correctly  the  length  scales. 

2.5.  Collection  of  the  observations 

In  this  section,  we  discuss  how  the  data  within  an  element  are  collected. 
We  have  to  consider  two  distinct  cases: 

1.  No  data  have  been  observed  yet  and  we  only  have  a  single  element  (i.e. 
X  itself). 

2.  We  have  obtained  a  fit  of  the  response  over  an  element  X*  based  on  Nl 
observations 

V'  =  {(VI”), 

and  we  have  decided  to  split  it  in  two  elements  X1’1  and  X1’2  so  that 
X?:  =  Xu  U  X1’2  and  X*’1  n  X*’2  =  0. 

Let  N  >  1  be  the  maximum  number  of  observations  per  element  we  wish  to 
consider  within  each  element  and  5  >  0  be  the  desired  uncertainty  tolerance 
of  each  element  (see  Eq.  (40)).  We  deal  with  the  first  case  (no  observations 
made  so  far),  by  simply  observing  N  random  data  points  drawn  from  the 
input  probability  distribution  p(x).  In  the  second  case,  we  wish  to  utilize 
the  MGP  we  already  have  for  X',  in  order  to  make  the  most  informative 
selection  of  new  data  points.  This  procedure  is  known  in  the  literature  as 
Experimental  Design  (ED). 
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The  ED  problem  can  be  formulated  in  a  Bayesian  framework  in  terms 
of  maximizing  the  expectation  of  a  utility  function  (see  [35]  for  a  good  re¬ 
view  of  Bayesian  ED).  If  we  observe  the  data  points  one  by  one  and  update 
the  model  each  time,  then  the  procedure  is  termed  Sequential  Experimental 
Design  (SED).  In  the  machine  learning  literature  SED  is  known  as  Active 
Learning  (AL).  According  to  MacKay  [29],  if  the  utility  function  we  choose 
is  the  change  in  entropy  of  the  posterior  of  the  hyper-parameters  6,  then  - 
under  the  evidence  approximation  -  the  most  informative  input  point  corre¬ 
sponds  to  the  one  that  maximizes  the  predictive  variance  of  the  model.  This 
criterion  is  termed  Active  Learning  MacKay  (ALM).  An  alternative  to  ALM 
is  Cohn’s  criterion  (ALC)  [36],  which  proceeds  by  choosing  the  input  point 
that  maximizes  the  expected  change  in  output  predictive  variance  over  the 
whole  domain.  ALC  has  the  advantage  that  it  allows  one  to  weight  the  input 
space  by  a  probability  distribution,  which  in  our  setting  would  naturally  be 
the  input  probability  distribution  of  the  element  X*.  ALC  has  also  been  nu¬ 
merically  shown  to  perform  better  than  ALM  (for  a  comparison  of  ALM  and 
ALC  see  [37]  and  the  corresponding  discussion  in  [38]).  However,  ALC  is  not 
based  on  a  decision  theoretic  foundation  and  it  is  much  harder  to  implement. 
In  this  work  -  mainly  for  computational  purposes  -  we  choose  to  work  with 
ALM.  We  now,  describe  its  extension  to  the  multi-output  case. 

We  start,  by  splitting  the  observed  data  in  two  sets  V1,1,  l  —  1,2  according 
to  which  element  the  inputs  belong  to,  i.e. 

V'E  =  {(x,  y)  e  V*  :  x  e  XM},  1  =  1,  2. 

Let  6*  be  the  hyper-parameters  of  the  MGP  over  X*  and  ajr  (x;  6*)  be  the  cor¬ 
responding  predictive  variance  of  the  r-th  output  at  x  e  X'  given  by  Eq.  (15). 
Throughout  the  SED  procedure,  the  hyper-parameters  will  be  kept  constant. 
Without  loss  of  generality,  we  work  with  the  left  child  of  X*,  X*’1.  The  right 
child  is  treated  similarly.  We  will  be  sequentially  observing  xnew,m  and  the 
corresponding  responses  ynew’m  =  f(xnew,m)  for  m  =  1,2, . . ..  Let  the  set  of 
observations  residing  in  X*’1  be: 

VlXn  =  Vu  U  {Xnew’m  :  m  =  1, . . . ,  n},  n  >  1, 

where  IT’1,0  =  T>1’1.  Denote  by  cr^  (x;  9*,  the  predictive  variance  of  the 

r-th  output  when  T>l,1,n  is  taken  into  account.  From  Eq.  (15),  it  is  apparent 
that  cr(r(x:  6*,Vl'1,n)  depends  only  on  the  observed  input  points  and  not  on 
the  responses.  Furthermore,  since  6*  remains  constant,  the  inverse  covariance 
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matrix  can  be  estimated  sequentially  at  each  step  without  the  need  to  perform 
a  Cholesky  decomposition  (see  [21]).  The  extension  of  ALM  to  the  multi¬ 
output  case  is  as  follows:  given  ,  observe  the  input  point  xnew,n+1  G  X*’1 
that  maximizes  the  joint  uncertainty  of  all  outputs: 

1  M 

cr?(x;0*;  !>«’")  =  t7X'tI(x;9'- (46) 

r= 1 

That  is, 

xnew,n+l  =  &rg  max  ^2/  .  g* .  pt.l.nx  (47) 

xexo 

In  an  effort  to  introduce  a  bias  from  the  input  probability  distribution,  we 
suggest  using: 


xnew,n+l  =  arg  max  ^2/.  g*.  (48) 

xgX1-1 

which  causes  low  probability  regions  to  be  ignored.  Of  course,  for  the  uniform 
case  the  two  criteria  are  equivalent.  We  stop,  either  if  N  data  points  have 
been  collected  in  or  if: 

^(^’WX*’1)  <6,  (49) 

where  pM('ZT’1,n)P(X*’1)  is  the  expectation  of  of  (x;  6*;Vt,1,n)  with  respect 
to  the  conditional  probability  pl,1(x)  of  X*’1  (in  the  same  spirit  as  it  was  used 
in  Section  2.4). 

The  optimization  problem  in  Eq.  (48)  is  relatively  hard  and  involves  sev¬ 
eral  local  maxima.  Instead  of  solving  it  with  a  direct  method,  we  use  a 
simple  Monte  Carlo  procedure  to  obtain  an  approximate  solution.  We  draw 
NtextALM  random  samples  in  X*’1,  evaluate  the  product  of  the  predictive 
variances  and  the  input  probability  density  (Eq.  (46))  and  select  the  one 
yielding  the  greatest  result.  This  is  affordable,  since  (x;  0*;  is  cheap 

to  evaluate. 

2. 6.  A  complete  view  at  the  framework 

In  this  final  section,  we  put  together  the  building  blocks  of  our  scheme  and 
discuss  the  algorithmic  details  and  possible  parallelization  strategies.  The 
basic  input  required  is  the  maximum  number  of  observations  per  element 
N  and  the  tolerance  <5  >  0,  used  for  the  refinement  criterion  (Eq.  (40))  as 
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Algorithm  1  The  complete  surrogate  building  framework 

WM(X,0,0)}. 

C  <-  0. 

while  U  ^  0  do 

Remove  (X*,  IT,  AT)  from  U. 

if  AT  =  0  then 

Observe  N  random  points  drawn  from  p*(x)  Eq.  (18). 
else 

while  [Dl\  <  N  or  Eq.  (49)  not  satisfied  for  5  do 

Add  an  observation  to  T>1  using  the  ALM  procedure  (Eq.  (48)). 
Update  AT  to  take  into  account  the  new  data  in  V1. 

end  while 
end  if 

Refit  the  hyper-parameters  of  AT  using  only  the  data  in  T>‘  (Section  2.1). 

if  Refinement  criterion  of  Eq.  (40)  is  satisfied  for  5  then 
Split  X*  in  X*’1  and  X*’2  according  to  Eq.  (44). 

Let  D1’1  and  T>1’2  to  be  the  set  observations  residing  in  X1’1  and  X*’2, 
respectively. 

U^U  U  {(X*’1, 2T’\  AT),  (X^2,  V*'2,  AT)}, 
else 

C^Cu{(Xi,Vi,Mi)}. 

end  if 
end  while 


well  as  the  stopping  criterion  of  ALM  (Eq.  (49)).  An  additional  input  is  the 
number  of  MC  samples  used  to  approximate  the  solution  to  Eq.  (48)  (last 
paragraph  of  Section  2.5),  which  we  fix  to  IValm  =  10000. 

Our  scheme  works  in  one  element  cycles  that  comprise  of  collecting  ob¬ 
servations  (randomly  or  using  ALM  (Section  2.5)),  fitting  (Section  2.1)  and 
adapting  (Section  2.4).  Let  us  denote  with  X*  a  stochastic  element,  V  the 
observations  made  on  X*  and  AT  the  MGP  fitted  over  X*  using  V1.  Let  C 
be  the  set  of  triplets  (X*,IT,  AT)  for  which  the  refinement  criterion  Eq.  (40) 
is  not  satisfied.  We  will  refer  to  C  as  the  set  of  completed  triplets.  The  rest  of 
the  triplets  are  put  in  U ,  called  the  set  of  uncompleted  triplets.  With  \Vl\  we 
denote  the  number  of  observations  inside  D'.  Algorithm  1  provides  a  serial 
implementation  of  the  scheme. 
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Parallelization  of  Algorithm  1  is  relatively  easy.  Each  node  p,  has  its  own 
set  of  completed  Cp  and  uncompleted  Up  elements.  Initially  the  root  node 
p  =  0  starts  as  in  Algorithm  1  and  the  rest  with  Up  =  0,  Cv  =  0,  p  ^  0. 
Then,  everything  proceeds  as  in  Algorithm  1  with  load  re-balancing  at  the 
end  of  each  outer  iteration  (uncompleted  elements  are  sent  to  processors  with 
Up  =  0). 

Remark  3.  The  choice  of  the  maximum  number  of  samples  per  element  N 
is  a  crucial  parameter  to  the  scheme.  Its  optimal  value  depends  in  a  compli¬ 
cated  way  on  the  (a  priori  unknown)  smoothness  of  the  underlying  response 
as  well  as  the  number  of  hyper-parameters  S.  Its  importance  is  more  ev¬ 
ident  on  the  very  first  element  of  the  scheme  because  it  drives  the  rest  of 
the  tree  construction  as  well  as  the  Active  Learning  procedure.  If  a  small 
value  is  used,  then  local  features  may  be  lost,  while  a  very  big  value  may 
result  in  redundant  information.  Similar  problems  are  present  in  practically 
all  UQ  schemes.  For  example,  ME-gPC  depends  on  the  polynomial  degree 
and  ASGC  depends  on  which  level  of  the  Sparse  Grid  is  adaptivity  initiated. 
On  the  other  hand,  N  makes  our  method  computationally  tractable,  since 
it  bounds  above  the  dimensions  of  the  covariance  matrices  that  need  to  be 
inverted.  A  theoretical  analysis  of  the  optimal  value  of  N  is  highly  desirable, 
but  clearly  beyond  the  scope  of  the  present  work.  In  the  engineering  prob¬ 
lems  that  we  are  interested  in,  one  usually  already  has  a  rough  idea  about 
the  smoothness  of  the  problem  based  on  some  preliminary  simulations.  For 
smooth  problems,  using  N  «  2K ,  where  K  is  the  number  of  input  dimensions 
gives  satisfying  results  (see  the  Elliptic  and  Natural  Convection  numerical  ex¬ 
amples  in  Section  3).  For  problems  with  local  features,  a  slightly  bigger  value 
must  be  used.  Empirically,  we  fix  5  to  a  high  value  (e.g.  5  ~  1CT1),  we  start 
with  N  =  2K  and  increase  N  gradually  until  the  results  do  not  change  any 
more.  For  this  final  N,  we  decrease  5  and  resume  the  scheme. 

3.  Numerical  Examples 

All  examples  are  run  on  massively  parallel  computers  at  the  National  En¬ 
ergy  Research  Scientific  Computing  Center  (NERSCC).  The  parallelization 
strategy  is  straightforward:  each  processor  is  assigned  to  work  with  a  sin¬ 
gle  element.  The  communication  burden  between  the  processes  is  minimal. 
Our  implementation  utilizes  extensively  the  Trilinos  library  [39]  as  well  as 
GSL  [40], 
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The  ultimate  goal  of  the  numerical  examples  is  to  demonstrate  that  the 
method  can: 

1.  learn  non-stationary  surfaces, 

2.  deal  with  discontinuities, 

3.  identify  localized  features  of  the  response  and 

4.  reduce  sampling  frequency  on  unimportant  input  dimensions. 

Whenever  possible,  we  will  compare  onr  results  with  Sparse  Grid  Collocation 
(SGC)  and  Adaptive  Sparse  Grid  Collocation  (ASGC)  [13].  Each  method  will 
be  evaluated  by  considering  an  error  measure  of  the  predictive  surface  or  of 
the  statistics,  as  a  function  of  the  number  of  sample  points  used.  In  Sec¬ 
tion  3.1,  we  investigate  the  performance  of  our  method  in  learning  three 
synthetic  functions.  In  Sections  3.2,  3.3,  3.4,  we  apply  our  method  to  UQ 
problems.  In  all  problems,  the  underlying  input  probability  distribution  p(x) 
is  understood  to  be  the  uniform  distribution  over  the  input  domain.  The  co- 
variance  function  we  use,  is  the  SE  with  a  nugget  g2  =  10-6  (The  nugget  is 
required  for  numerical  stability.  See  the  discussion  in  Appendix  A  for  more 
details.).  All  tasks  start  with  a  single  element  (the  input  domain  itself)  and 
N  random  samples  drawn  from  the  input  distribution.  N ,  is  also  the  max¬ 
imum  number  of  samples  taken  within  an  element  and  is  different  for  each 
example  (See  Remark  3  for  to  see  how  N  can  be  chosen).  From  that  point, 
the  algorithm  proceeds  until  a  pre-specified  tolerance  6  >  0  is  reached.  The 
refinement  criterion  is  given  by  Eq.  (40)  for  q  =  1.  The  same  tolerance  5  is 
used  to  stop  the  ALM  procedure  of  Section  2.5  (see  Eq.  (49)).  The  solution 
to  the  optimization  problem  of  ALM  (Eq.  (48))  is  approximated  by  drawing 
AAlm  =  10000  samples  in  X®,  evaluating  a? (x;  0*;  T>l,1,n)  and  selecting  the 
one  with  the  maximum  value.  The  parameters  of  the  method  are  g,  N,  AAlm 
and  S. 

3.1.  Simple  Validation  Examples 

The  purpose  of  this  section  is  to  demonstrate  using  simple,  single-output 
functions,  the  claims  1  —  4  made  at  the  beginning  of  this  section.  The  three 
synthetic  functions  we  are  going  to  use  have  been  introduced  in  [38].  The  per¬ 
formance  of  each  run  is  evaluated  by  comparing  the  predictive  mean  /i/r(x) 
to  the  true  response.  The  error  measure  of  choice  here  is  the  Mean  Square 
Error  (MSE)  of  S'  =  105  random  samples  drawn  fromp(x).  Specifically,  MSE 
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is  defined  to  be 


1  S  M 

MSE(Wr(-))  :=  (W-(*W)  -  /r(x(<)))2,  (50) 

s=l  r= 1 

where  xP\  s  =  1, . . . ,  S'  are  random  samples  from  p(x).  Those  samples  were 
not  used  in  the  fitting  procedure,  hence  MSE  is  a  measure  of  the  predictive 
capabilities  of  the  regression  method. 

ID  non- stationary,  discontinuous  function.  Consider  the  real  function: 


h(x) 


sin(f)  +  fcos(^)  ,  x  <10 
-jj;  —  1,  otherwise 


(51) 


on  the  domain  X  =  [0,20].  For  x  <  10,  it  varies  with  two  different  frequen¬ 
cies.  For  x  >  10  it  is  linear  and  finally  it  has  a  discontinuity  at  x  =  10. 

We  learn  this  function  with  our  framework  using  N  =  5  until  various  toler¬ 
ances  are  reached.  Fig.  1  compares  the  MSE  of  MGP  with  ASGC  for  various 
numbers  of  observations.  The  observations  shown  for  MGP  correspond  to 
tolerances  of  <5  =  ICC1, ICC2, ICC5, ICC6, ICC7  and  ICC8.  The  e  parameter  of 
ASGC  (see  [13])  is  a  lower  bound  of  the  sparse  grid  surpluses.  The  bigger  e  is, 
the  more  samples  ASGC  skips.  As  e  goes  to  zero,  the  ASGC  approaches  SGC. 
For  large  values  of  e  though,  ASGC  fails  to  converge.  Hence,  e  determines  the 
balance  between  exploration  and  exploitation  is  ASGC.  It  is  apparent  that 
ASGC  is  out-performed  by  MGP  by  almost  an  order  of  magnitude.  Fig.  2 
plots  the  predictive  mean  /if1  (x)  with  95%  error  bars  for  5  =  ICC2, 10“4  and 
ICC6  along  with  the  true  response  fi(x)  where  the  symbols  mark  the  position 
of  the  observed  data  (left  column).  Notice,  that  the  linear  part  is  already 
captured  at  5  =  ICC2  (13  observations)  and  that  the  region  x  >  10  is  not 
sampled  any  further.  Another  important  observation  is  that  the  error  bars 
are  maximized  in  regions  of  space  where  the  true  error  is  bigger.  This  fact  is 
the  empirical  justification  of  their  usage  in  the  SED  framework  of  Section  2.5. 
As  the  lower  levels  of  tolerance  are  reached,  more  and  more  samples  are  col¬ 
lected  inside  the  important  regions  and  the  discontinuity  is  finally  resolved. 
The  right  column  of  the  same  figure,  plots  the  value  of  the  inferred  length 
scale  £  as  a  function  of  x  (one  length  scale  at  each  element).  The  linear  region 
is  treated  as  a  single  element  with  a  large  length  scale,  while  the  rest  of  the 
domain  is  fragmented  in  smaller  elements  with  small  length  scales. 
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Figure  1:  The  MSE  in  the  prediction  of  fi(x)  as  a  function  of  the  observed  samples  for 
MGP  and  ASGC  for  various  e. 

2D  function  with  local  features.  Let  us  now  consider  a  two-dimensional  real 
function: 

f2(x1,x2)  =  aqexp  {—x\  -  x\)  ,  (52) 

on  X  =  [—2,  6]2.  This  function  is  peculiar,  in  the  sense  that  it  has  two  local¬ 
ized  features  inside  the  box  [ — 2,  2]2,  while  it  is  practically  zero  everywhere 
else.  The  choice  of  N  in  this  example  plays  an  important  role  since  it  deter¬ 
mines  the  starting  point  of  our  algorithm.  We  have  numerically  verified  that 
for  N  =  5  there  is  a  high  probability  of  not  observing  the  localized  features. 
For  N  =  10  and  20  the  features  are  observed,  albeit  after  a  few  fluctuations 
which  result  in  a  higher  number  of  observations  being  made.  ASGC  starting 
from  Level  1  (any  e)  fails  to  correctly  identify  the  location  of  the  localized 
features,  since  it  does  not  sample  inside  [— 2,2]2.  On  the  other  hand,  SGC 
requires  a  very  large  number  of  observations.  Here,  we  choose  to  report  our 
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Figure  2:  Left  column  (a,  c,  e):  comparison  of  the  predictive  mean  /if1  (x)  (dashed  blue 
line)  and  95%  error  bars  (shaded  grey  area)  with  true  response  fi{x)  (solid  red  line),  where 
the  symbols  mark  the  observed  data.  Right  column  (6,  d,  /):  predicted  length  scale  across 
the  domain.  The  rows  correspond  to  tolerances  5  =  10~2, 10~4  and  10-6  with  number  of 
samples  gathered  13,  25  and  94,  respectively. 
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results  for  N  =  50.  Fig.  3  plots  the  MSE  for  MGP  and  SGC  as  a  function 
of  the  number  of  observations.  The  MSE  of  ASGC  is  not  reported  since  it 
fails  to  identify  the  localized  features  when  it  starts  from  Level  1.  The  ob¬ 
servations  shown  for  MGP  correspond  to  tolerances  of  5  =  10  3, ICG4, 10-5 
and  ICG6.  As  expected,  SGC  is  out-performed  by  more  than  two  orders  of 
magnitude.  Fig.  4  shows  the  contour  of  the  predictive  mean  fif2(x  i,x2)  for 
tolerances  6  =  ICG4, ICG5  and  ICG6  (right  column)  along  with  the  decompo¬ 
sition  of  the  stochastic  domain.  The  left  column  depicts  the  corresponding 
observed  input  points  (left  column).  Notice  how  the  density  of  the  observa¬ 
tions  is  increasing  in  the  important  regions  as  lower  As  are  reached,  gradually 
revealing  the  local  features. 


Figure  3:  The  MSE  in  the  prediction  of  /Ax)  as  a  function  of  the  observed  samples  for 
MGP  and  SGC.  ASGC  (e  =  1(G3)  is  not  reported  since  it  fails  to  identify  the  localized 
features. 


6D  function  with  unimportant  dimensions.  Finally,  we  consider  the  six- dimensional 
real  function: 

fz{x  1,  x2,  x3,  x4,  x5,  x6)  =  exp  {sin  ((0.9(xi  +  0.48)10)  }  +  x2x3  +  x4 ,  (53) 
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on  the  hypercube  X  =  [0,  l]6.  varies  wildly  as  a  function  of  x\  (see  (a)  of 
Fig.  5),  it  is  linear  in  x 4,  quadratic  with  respect  to  X2  and  X3  and  constant 
for  X5  and  x§.  We  learn  it  using  our  scheme  with  N  =  10.  Fig.  6  plots  the 
MSE  for  MGP,  SGC  and  ASGC  as  a  function  of  the  number  of  observations. 
ASGC  is  out-performed  by  at  least  an  order  of  magnitude.  In  Fig.  5,  we 
analyze  the  distribution  of  the  observed  input  points.  In  particular,  we  plot 
the  histogram  of  the  projection  of  the  observed  inputs  on  X\  (b),  x§  (c)  and 
Xq  (d)  axes.  Notice  that  MGP  increases  the  sampling  density  in  important 
regions  with  respect  to  X\  while  x 5  and  x§  are  sampled  uniformly.  The 
histograms  for  £2,^3  and  X4  are  similar  to  x5  and  xe. 
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Figure  4:  Left  column  (a,  c,  e):  predictive  mean  for  /2(x)  of  the  MGP  (N  =  50)  and 
decomposition  of  the  domain  for  S  =  10-4, 10-5  and  10-6  (top  to  bottom).  Right  column 
(b,  d1  /):  observations  made  for  the  same  6. 
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Figure  5:  (a)  The  additive  part  of  fs(x i, . . .  ,x§)  that  depends  on  X\\  (b),  (c)  and  (d)  are 
histograms  of  the  projections  of  the  observed  inputs  on  the  X\ .  x$  and  Xq  axes,  respectively. 
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Figure  6:  The  MSE  in  the  prediction  of  /s(x)  as  a  function  of  the  observed  samples  for 
MGP,  SGC  and  ASGC  (e  =  1CT3). 
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3.2.  Krainchnan-Orszag  three-mode  problem 

Consider  the  system  of  ordinary  differential  equations  [6]: 

dyi 

-7 r  =  2/12/3, 


dy 2 
dt 
dy  3 

dt 


-yiyz, 

-y\  +  2/2, 


subject  to  random  initial  conditions  at  t  =  0.  This  dynamical  system  is 
particularly  interesting  because  the  response  has  a  discontinuity  at  the  planes 
|/i  (0)  =  0,  1/2(0)  =  0.  The  deterministic  solver  we  use  is  a  4th  order  Runge- 
Kutta  method  as  implemented  in  GNU  Scientific  Library  [40].  We  solve 
the  system  for  the  time  interval  [0, 10]  and  record  the  response  at  time  step 
intervals  of  At  =  0.01.  This  results  in  a  total  of  M  —  300  outputs  (100  for 
each  of  the  three  dimensions  of  the  response).  We  will  consider  three  different 
cases  of  increasing  difficulty  with  one,  two  and  three  input  dimensions.  The 
results  we  obtain  will  be  compared  to  a  MC  estimate  with  106  samples.  Let 
the  MC  mean  and  variance  be  mr> mc  and  rr,MC)  respectively,  r  —  1, . . . ,  300. 
The  error  of  the  statistics  will  be  evaluated  using  the  (normalized)  L2  norm 
of  the  error  in  variance  defined  by: 


El2 


1 

M 


M 

(b,MC  —  hvr)2  , 

r= 1 


(54) 


where  /i?v  is  the  predictive  mean  of  vr  (Eq.  (32)).  The  results  are  compared 
with  SGC  and  ASGC. 


One- dimensional  Problem.  In  the  one- dimensional  case,  we  define  the  stochas¬ 
tic  initial  conditions  by: 

2/i(0)  =  1,  1/2(0)  =  O.lx,  2/3(0)  =  0, 


where 

x~U([- 1,1]), 


with  U ([ — 1, 1])  being  the  uniform  probability  distribution  over  [—1, 1],  This 
stochastic  problem  has  a  discontinuity  at  x  —  0.  We  solve  it  for  N  =  5. 
Fig.  7  shows  the  L2  norm  of  the  error  in  variance  for  MGP,  SGC  and  ASGC 
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as  a  function  of  the  number  of  observations.  ASGC  for  e  =  10-1  fails  to 
converge  and  so  it  is  not  reported.  MGP  slightly  out-performs  ASGC  and 
SGC,  especially  when  just  a  few  samples  are  used.  Fig.  8  depicts  the  pre¬ 
diction  of  7/2  (t  =  10)  and  y3  (t  =  10)  at  levels  of  tolerance  5  =  10-3, 10”5 
and  10-7.  Again,  we  observe  that  the  error  bars  are  qualitatively  equivalent 
to  the  true  error.  Notice  how  the  discontinuity  is  gradually  resolved.  Fig.  9 
plots  the  predictive  mean  and  variance  of  y3  (t)  as  a  function  of  time  t  along 
with  95%  error  bars  (see  Eqs.  (28)  and  (31))  and  compares  them  with  the 
MC  predictions.  The  error  bars  of  the  statistics  are  qualitatively  correct 
but  -  as  expected  by  the  independence  assumption  (Section  2.2)  -  they  are 
over-estimated.  This  situation  is  more  pronounced  in  the  predictions  for  the 
statistics  of  the  two  and  three  dimensional  problems. 


Figure  7:  KO-1:  the  L2  norm  of  the  error  in  variance  as  a  function  of  the  observed  samples 
for  MGP,  SGC  and  ASGC. 
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Two-dimensional  Problem.  For  the  two-dimensional  problem,  the  stochastic 
initial  conditions  are  defined  by: 

2/i(0)  =  1,  1/2(0)  =  O.Lri,  //3(0)  =  *2, 

where 

Xi  ~  £/([— 1, 1]),  i  =  1,2. 

This  problem  has  a  line  discontinuity  at  *1  =  0.  We  run  the  MGP  framework 
for  IV  =  10.  Fig.  10  shows  the  L2  norm  of  the  error  in  variance  for  MGP,  SGC 
and  ASGC  as  a  function  of  the  number  of  observations.  At  this  example, 
the  performance  of  MGP  and  ASGC  (e  =  10-2)  is  approximately  the  same. 
Fig.  11  depicts  the  prediction  at  2/3  ( t  =  10)  along  with  the  stochastic  elements 
at  levels  of  tolerance  5  =  10~3,  10^5  and  10-7.  As  a  lower  tolerance  is 
reached,  the  stochastic  mesh  adapts  around  the  discontinuity  increasing  the 
sampling  density.  Fig.  12  plots  the  predictive  mean  and  variance  of  2/3%) 
as  a  function  of  time  t  along  with  95%  error  bars  and  compares  it  with  the 
MC  prediction.  Again,  we  notice  that  the  error  bars  are  over-estimated. 
Finally,  by  using  104  samples  of  the  surrogate,  we  provide  a  kernel  density 
approximation  to  the  probability  density  function  (PDF)  of  y2  (t  =  10)  and 
1/3  (f  =  10)  and  compare  it  to  an  MC  estimate  with  the  same  number  of 
samples  (Fig.  13). 

Three-dimensional  Problem.  The  three-dimensional  problem  is  defined  to 
have  initial  conditions: 

2/i(0)  =  *1,  2/2(0)  =  *2,  2/3(0)  =  *3, 


where 

Xi~U([-l,l]),i  =  1,2,3. 

We  run  our  framework  for  iV  =  20.  Fig.  14  shows  the  L2  norm  of  the 
error  in  variance  for  MGP,  SGC  and  ASGC  as  a  function  of  the  number  of 
observations.  ASGC  with  e  =  10”1  fails  to  converge.  MGP  out-performs 
ASGC.  Fig.  15  plots  the  predictive  mean  and  variance  of  2/3%)  as  a  function 
of  time  t  along  with  95%  error  bars  and  compares  it  with  the  MC  prediction. 
Finally,  Fig.  16  plots  the  kernel  density  estimate  of  the  PDF  of  y2  ( t  =  10) 
and  2/3  (t  =  10)  using  104  samples  of  the  surrogate. 
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N=5,  8=1  e-3 


N=5,  8=1  e-3 


Figure  8:  KO-1:  prediction  (dashed  blue)  with  95%  error  bounds  for  tolerances  (top  to 
bottom)  5  =  10-3, 10-5  and  10~7  versus  the  true  response  (solid  red)  for  y2  (t  =  10)  (left 
column,  a,  c,  e)  and  y$  (t  =  10)  (right  column,  b,  d,  /). 
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N=5,  8=1  e-3 


Time  (t) 

(a) 


N=5,  8=1  e-5 


Time  (t) 

(c) 


N=5,  8=1  e-7 


Time  (t) 

(e) 


N=5,  8=1  e-3 


(b) 


N=5,  8=1  e-5 


Time  (t) 

(d) 


N=5,  8=1  e-7 


(f) 


Figure  9:  KO-1:  predictive  mean  (dashed  blue)  versus  MC  estimate  (solid  red)  of  the 
mean  (left  column,  a,  c,  e)  and  variance  (right  column,  6,  d,  /)  of  ys  (t)  with  95%  error 
bounds  for  tolerances  (top  to  bottom)  S  =  10-3,  10-5  and  10-7. 
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norm 


Figure  10:  KO-2:  the  L2  norm  of  the  error  in  variance  as  a  function  of  the  observed 
samples  for  MGP,  SGC  and  ASGC. 
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Figure  11:  KO-2:  The  prediction  at  y 3  (t  =  10)  with  the  stochastic  elements  (left  column, 
a,  c,  e)  and  the  observed  samples  (right  column,  b,  d ,  /)  for  tolerances  (top  to  bottom) 
S  =  10~3,  10~5  and  10"7. 
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N=10,  6=1  e-4  N=10,  5=1  e-4 


N=10,  8=1  e-6  N=10,  5=1  e-6 


N=10,  8=1  e-8  N=10,  8=1  e-8 


(e) 


(f) 


Figure  12:  KO-2:  predictive  mean  (dashed  blue)  versus  MC  estimate  (solid  red)  of  the 
mean  (left  column,  a,  c,  e)  and  variance  (right  column,  6,  d,  /)  of  ys  (t)  with  95%  error 
bars  for  tolerances  (top  to  bottom)  8  =  10-4, 10-6  and  10-8. 
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(a) 


(b) 


Figure  13:  KO-2:  kernel  density  estimation  of  the  PDF  of  1/2  (t  =  10)  (left)  and  2/3  (t  =  10) 
(right)  using  105  samples. 


Figure  14:  KO-3:  the  L2  norm  of  the  error  in  variance  as  a  function  of  the  observed 
samples  for  MGP,  SGC  and  ASGC. 
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N=20,  5=1  e-4 


(a) 


N=20,  5=1  e-6 


(c) 


N=20,  5=1  e-8 


N=20,  5=1  e-4 


(b) 


N=20,  5=1  e-6 


(d) 


N=20,  5=1  e-8 


(e) 


(f) 


Figure  15:  KO-3:  predictive  mean  (dashed  blue)  versus  the  MC  estimate  (solid  red)  of  the 
variance  of  y\(t)  (left  column,  a,  c,  e)  and  y2(t)  (right  column,  b,  d,  /)  with  95%  error 
bars  for  tolerances  (top  to  bottom)  S  =  10”4, 10-6  and  10-8. 


42 


(a)  (b) 


Figure  16:  KO-3:  kernel  density  estimation  of  the  PDF  of  2/2  (t  =  10)  (left)  and  2/3  {t  =  10) 
(right)  using  105  samples. 
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3.3.  Elliptic  Problem 

In  this  section,  we  consider  a  simple  stochastic  elliptic  problem  [41].  Con¬ 
sider  the  stochastic  partial  differential  equation  (SPDE): 

-V  •  (aK(v,  -)Vu(u,  •))  =  /(■),  inA 
u(u>,  •)  =  0,  on  dD, 

where  the  physical  domain  is  D  =  [0,  l]2.  In  order  to  avoid  confusion  with 
the  physical  dimension  x,  we  have  chosen  to  denote  the  random  variables 
with  u)  instead  of  x.  We  choose  a  smooth  deterministic  load: 


f(x,y)  =  100cos(x)  sia(y), 


and  work  with  homogeneous  boundary  conditions.  The  deterministic  prob¬ 
lem  is  solved  with  the  finite  element  method  using  400  (20  x  20  grid)  bilinear 
quadrilateral  elements.  The  random  diffusion  coefficient  ax( u,x)  is  con¬ 
structed  to  have  a  one-dimensional  dependence: 


log (aK(u>,x,y)  -  0.5) 


K 


1  +  UJ\ 


f  ik4>k(x)uk, 


V 


k= 2 


(55) 


where 

6c  :=  (VnL)1/2  exp  ^ j  ,  for  k  >  2, 

and 

sin  ^  'L2^d  j  ,  if  k  is  even, 

cos  ,  if  k  is  odd, 

L-J  being  the  integer  part  of  real  number.  We  choose  the  u>k,  k  —  1, . . , ,  K  to 
be  independent  identically  distributed  random  variables: 

uk  ~  U([—V 3,  V3]). 

Hence,  the  stochastic  input  space  is  f2  =  [ — \/3,  V3]K .  Finally,  we  set: 

Lp  =  max{l,  2LC}  and  L  —  — , 

Lp 
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where  Lc  is  called  the  correlation  length.  The  expansion  Eq.  (55)  resem¬ 
bles  the  Karhunen-Loeve  expansion  of  a  two-dimensional  random  held  with 
stationary  covariance 

Cov[log(aA'  -  0.5)]((xi,j/i),  (x2,y2))  =  exp  . 

In  this  study,  we  set  the  correlation  length  to  Lc  =  0.6  and  test  the 
convergence  of  our  method  for  K  =  10,  20  and  40  input  dimensions.  The 
results  for  K  =  10, 20  and  40  are  evaluated  by  calculating  the  L2  error 
in  variance  (Eq.  (54))  using  a  plain  MC  estimate  with  106  samples.  The 
performance  is  compared  to  ASGC  for  various  e.  The  K  =  10,  20  and  40 
cases  are  solved  using  N  =  20,  40  and  80  up  to  a  tolerance  of  10”7,  10~5 
and  10~4,  respectively.  Figures  17,  18  and  19  show  the  L2  error  in  variance  for 
each  case.  In  all  cases  MGP  outperforms  ASGC,  especially  when  the  number 
of  samples  is  small.  The  error  curves  in  Fig.  18  become  asymptotically  hat 
for  all  methods  (MGP  and  ASGC)  as  a  result  of  the  MC  accuracy  being 
reached.  Fig.  20  shows  the  convergence  of  the  prediction  for  the  variance 
of  MGP  as  the  tolerance  threshold  is  lowered  to  5  =  10-7.  Subhgure  (e)  of 
the  same  figure,  plots  the  uncertainty  of  the  variance  (Eq.  (33))  at  that 
tolerance.  As  already  observed  in  previous  examples,  a%r  over-estimates  the 
true  error.  Fig.  21  tests  the  predictive  capabilities  of  MGP  for  K  =  10  at  a 
tolerance  5  =  10~6  on  a  random  input  point.  We  notice  a  good  agreement 
with  the  true  response. 
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Figure  17:  Elliptic,  K  =  10:  The  L 2  norm  of  the  error  in  variance  of  the  elliptic  problem 
with  I\  =  10  inputs  as  a  function  of  the  observed  samples  for  MGP,  SGC  and  ASGC. 
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Figure  18:  Elliptic,  K  =  20:  The  L-2  norm  of  the  error  in  variance  of  the  elliptic  problem 
with  I\  =  20  inputs  as  a  function  of  the  observed  samples  for  MGP  and  ASGC. 
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Figure  19:  Elliptic,  K  =  40:  The  L2  norm  of  the  error  in  variance  of  the  elliptic  problem 
with  K  =  40  inputs  as  a  function  of  the  observed  samples  for  MGP  and  ASGC. 
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MGP,  N=20,8=10_1:  Variance  MGP,  N=20,8=10“3:  Variance 


Figure  20:  Elliptic,  I\  =  10:  Convergence  of  the  predicted  variance  as  the  tolerance 
decreases.  Subfigure  (f)  refers  to  MC  results  and  subfigure  (e)  shows  the  uncertainty 
associated  with  the  predicted  variance  a%r  at  5  =  10~7. 
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MGP,  N=20:  Prediction  True  response 


(c) 


(d) 


Figure  21:  Elliptic,  K  =  10,  8  =  10-6:  Comparing  the  prediction  (a)  at  a  random  input 
point  (d)  with  the  true  response  (b).  Subfigure  (c)  shows  the  corresponding  predictive 
variance. 
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3-4-  Natural  Convection  Problem 

Consider  the  dimensionless  form  of  the  Oberbeck-Boussinesq  approxima¬ 
tion  using  the  vorticity  transport  equation  in  stream-function  formulation: 


<9  2  ,  dp)  d  2  /  ,  dp)  d  2  / 

V  p)  -  x-T-V  Ip  +  T^TpV  p) 
at  dy  ox  ox  dy 

dT  dp)  dT  dp)  dT 

dt  dy  dx  dx  dy 


=  —  Pr  X7Ap)  +  Ra  Pr 


=  V2T, 


dT 
dx  1 


where  Pr  and  Ra  are  the  Prandtl  and  Rayleigh  numbers,  respectively.  In  this 
formulation,  the  velocity  field  is  given  by: 


dp) 


(56) 


We  solve  the  problem  in  a  two-dimensional  square  cavity  X  =  [0,  l]2.  We 
impose  no  slip  conditions  to  the  boundary: 


u(x,y )  =  0,  v(x,y)  =  0,  for  (x,y)  G  dX. 


The  two  horizontal  walls  are  considered  adiabatic: 

9T fe V i  =  0,  for  0  <  x  <  1,  y  =  0, 1. 
dy 

The  right  vertical  wall  (hot)  is  kept  at  a  constant  temperature: 

T(l,y)  =  0.5,  for  0<y<l. 

The  left  vertical  wall  (cold)  is  taken  to  be  a  one- dimensional  Gaussian  stochas¬ 
tic  process  with  mean  —0.5  and  exponential  covariance 

Cov  [x\ ,  X2]  =  s2  exp  I  —  -  1 

l  Lc 

where  s 2  is  the  variance  of  the  signal  and  Lc  the  correlation  length.  Using 
the  Karhunen-Loeve  (KL)  expansion,  we  may  write 

OO 

T( 0,  y;  w)  =  -0.5  +  ^  ~k(pk{y)F~1{uk), 

k=  1 
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where  A/c  and  (f>k(y)  are  the  eigenvalues  and  eigenvectors  of  the  covariance 
function  and  F~x  is  the  inverse  cumulative  distribution  function  of  J\f( 0, 1) 
and  Uk  are  independent  uniform  random  variables  in  [0, 1].  It  is  noted  here 
that  Afc  and  4>k{y )  are  analytically  available  [42], 

In  this  study,  we  set  Lc  =  1  and  keep  only  K  =  4  or  8  terms  in  the  KL 
expansion.  The  parameters  we  use  are  Pr  =  1  and  Ra  =  5000.  The  deter¬ 
ministic  problem  is  solved  using  the  Nektar  fluid  dynamics  code  [43],  which 
utilizes  spectral  elements.  The  domain  was  decomposed  in  240  quadrilateral 
elements  (12  x  12  grid)  and  4  spectral  modes  were  used  on  each  one.  It  has 
been  numerically  verified  that  no  more  modes  were  necessary  for  convergence 
of  the  spectral  elements.  The  output  is  observed  at  16  (4  x  4  grid)  equidis¬ 
tant  mesh  points  on  each  element.  This  results  in  a  total  of  2401  outputs  for 
each  of  the  physical  quantities  of  interest  (T,u,v  and  the  pressure  p).  The 
total  number  of  output  dimensions  is  thus  M  =  9604.  For  computational 
convenience,  we  only  work  with  temperature  T  and  the  u  component  of  the 
velocity,  a  total  of  4802  output  parameters.  For  K  =  2  and  4,  we  run  our 
scheme  until  a  tolerance  6  =  10~5  is  reached  with  IV  =  10.  A  total  of  1393 
and  14396  observations  were  made,  respectively.  For  K  =  8,  we  reach  a 
tolerance  of  A  =  10~3  which  results  in  829  observations  being  made.  Fig.  22 
compares  the  predicted  standard  deviations  (std.)  of  u  (top)  and  T  (bottom) 
for  K  =  8  with  MC  estimates  using  80,  000  samples.  The  results  are  in  good 
agreement  with  the  MC  estimates.  In  Figs.  23,  we  draw  a  random  sample 
from  the  input  distribution  for  the  K  =  4  case.  We  present  the  predictive 
mean  of  T  along  with  two  std.’s  and  compare  it  to  the  absolute  error.  No¬ 
tice  that  the  two  std.’s  are  qualitatively  similar  to  the  absolute  error  of  the 
prediction. 
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(a)  K=8,  MGP  {N  =  20 ,5  =  10"3):  (b)  K=8,  MC:  std.  of  u 

std.  of  u 


(c)  K=8,  MGP  (N  =  20 ,6  =  10"3):  (d)  K=8,  MC:  std.  of  T 

std.  of  T 


Figure  22:  Natural  Convection:  MGP  prediction  at  tolerance  level  5  =  10^3  for  the 
standard  deviation  of  the  velocity  u  (top)  and  temperature  T  (bottom)  compared  to  a 
MC  estimate  for  K  =  8  input  dimensions. 
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(a)  Mean  T  prediction 
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(c)  2  std.’s  of  T 

Figure  23:  Natural  Convection  ( K  =  4, 8  = 
input  point  with  the  true  response. 


(b)  Absolute  error 
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5):  Comparing  the  prediction  at  a  random 
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4.  Conclusions 


We  have  developed  a  novel,  non-intrusive  Bayesian  scheme  based  on  a 
treed  multi-output  GP  model  that  can  be  used  in  UQ  tasks.  The  tree  is  built 
in  a  sequential  way  that  utilizes  information  contained  only  in  the  data  ob¬ 
served  so  far.  Tree  refinement  depends  on  the  observations  through  a  global 
measure  of  the  uncertainty  in  the  prediction,  the  inferred  length  scales  as 
well  as  the  input  probability  distribution.  A  Sequential  Experimental  Design 
technique  based  on  the  predictive  uncertainty  was  also  used  to  adaptively 
select  the  most  informative  input  points  on  each  element.  The  final  result  is 
a  non-stationary,  predictive  distribution  for  the  response  of  the  underlying 
system,  that  can  be  semi-analytically  integrated  to  provide  point  estimates 
and  error  bars  for  the  statistics  of  interest.  We  have  numerically  demon¬ 
strated  that  the  framework  can  (1)  capture  non-stationary  responses,  (2) 
locate  discontinuities,  (3)  identify  localized  features  and  (4)  reduce  the  sam¬ 
pling  frequency  on  unimportant  input  dimensions.  The  method  was  shown  to 
outperform  SGC  and  ASGC  in  almost  all  numerical  examples  investigated, 
especially  when  only  a  small  number  of  observations  were  used. 

The  presented  framework  is  particularly  interesting,  in  that  it  can  be  ex¬ 
tended  in  several  ways  that  can  improve  its  performance  dramatically.  From 
a  technical  point  of  view  several  aspects  require  further  numerical  investiga¬ 
tion:  e.g.  the  dependence  of  the  result  on  the  choice  of  the  maximum  number 
of  samples  per  element  N,  the  performance  of  the  ALC  experimental  design 
technique  instead  of  the  ALM  scheme  used  in  the  current  work,  the  depen¬ 
dence  of  the  final  decomposition  of  the  stochastic  space  on  the  refinement 
criterion  Eq.  (40)  for  g  ^  1  and  so  on.  Another  important  development  would 
be  to  replace  the  current  multi-output  GP  model  with  a  GP  model  that  ex¬ 
plicitly  takes  into  account  correlation  between  the  outputs.  Such  an  effort,  is 
expected  to  reduce  the  number  of  samples  required  significantly.  Currently, 
the  GPs  learnt  on  each  element  are  dropped  if  the  element  is  split  in  half. 
The  result  is  that  each  element  is  treated  independently  and  the  response  is 
not  smooth  along  the  element  boundaries.  Alternatively,  another  treed  GP 
model  can  be  formulated  in  which  the  children  of  a  node  would  learn  the 
residual  of  the  response  instead  of  the  response  itself.  In  such  a  way,  the 
upper  nodes  of  the  tree  would  model  coarse  features  of  the  response,  while 
localized  features  would  be  resolved  by  the  leaves  of  the  tree.  Finally,  a  great 
deal  of  effort  must  be  put  in  mathematically  working  out  the  error  bounds 
in  the  various  statistics  that  result  from  the  uncertainty  of  the  prediction. 
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As  already  mentioned  in  Section  2.2,  the  proper  Bayesian  way  to  account  for 
the  uncertainty  of  the  predicted  statistics,  would  be  via  an  MC  procedure: 
we  would  sample  a  complete  response  surface  from  the  full  model,  integrate 
it  with  respect  to  the  input  probability  distribution  and  obtain  a  sample  of 
the  statistics.  The  mathematical  details  of  such  a  procedure  are  the  subject 
of  our  current  research. 

Appendix  A.  Implementation  Details 

In  this  appendix,  we  discuss  several  details  with  regards  to  the  implemen¬ 
tation  of  the  UQ  framework  presented. 

The  nugget.  The  covariance  function  we  use  has  the  special  form: 

c(x<">,X("*>;0,<7)  =  c(xW,xW;e)  +g2Snm,  (A.l) 

where  c(-,-;0)  is  a  normal  covariance  function  depending  on  some  hyper- 
parameters  6,  g2  >  0  and  8nm  is  the  Kronecker  delta.  Such  a  covariance 
function  corresponds  to  the  case  where  f(x)  is  observed  with  additive  Gaus¬ 
sian  noise  with  zero  mean  and  variance  g2  (see  p.  16  of  [22]).  In  the  literature 
of  analysis  of  computer  experiments  using  GPs,  g2  is  known  as  the  nugget. 
Many  authors  (e.g.  [14],  [15]),  omit  the  nugget  on  the  grounds  that  com¬ 
puter  codes  are  deterministic.  Inclusion  of  the  nugget,  however,  has  been 
observed  to  enhance  numerical  stability  in  factorizing  the  covariance  ma¬ 
trix  [44,  17].  On  our  part,  we  have  observed  that  numerical  stability  is 
further  improved,  if  a  zero  mean,  g2  Gaussian  noise  is  added  to  the  scaled 
observed  responses  Eq.  (11).  The  effect  of  the  nugget  is  the  addition  of  a  g 2 
term  in  the  predictive  variance  of  the  scaled  responses.  A  typical  value  of 
the  nugget  we  use  in  the  numerical  examples  is  g2  =  10-6.  For  a  very  recent 
discussion  on  the  importance  of  the  nugget  in  computer  modeling,  see  [45]. 

Maximizing  the  marginal  likelihood.  In  this  work,  we  make  exclusive  use  of 
the  SE  covariance  function  defined  in  Eq.  (41).  Its  hyper-parameters  are 
the  signal  strength  Sf  >  0  and  the  length  scale  of  each  stochastic  input 
4  >  0.  Each  stochastic  element  is  associated  with  its  local  hyper-parameters 
which  are  found  by  maximizing  the  joint  marginal  likelihood  subject  to  the 
positivity  constraint.  In  order  to  achieve  this  in  practice,  we  maximize  with 
respect  to  the  logarithm  of  these  quantities,  i.e.  we  re-parameterize  the 
covariance  function  as: 

0\  =  log sf,  6k+ 1  =  log4. 
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This  results  in  an  equivalent  unconstrained  optimization  problem  which  we 
solve  using  a  Conjugate  Gradient  (CG)  method  [30],  i.e.  Eq.  (12)  with  ©  = 
RK+l.  It  is  important  to  notice  that  the  nugget,  g2,  is  not  optimized.  It 
remains  fixed  to  a  given  small  value.  Specifically,  we  used  the  Fletcher- 
Reeves  CG  algorithm  [46]  as  implemented  in  GSL  [40].  The  starting  values 
d0  =  (dito,  •  •  • ,  9k+i,o)  °f  the  optimization  algorithm  are  chosen  as  follows: 

1.  If  we  fit  a  GP  for  the  first  time  (i.e.  using  X  itself  as  the  first  element), 
we  set  0i,o  =  0  for  the  signal  parameter  and 


9k+ i,o  =  log 


for  the  length  scale  parameters,  where  —  a k  is  the  extent  of  X 

along  the  fc- dimension  (Eq.  (42)). 

2.  Otherwise,  if  X*  comes  from  splitting  in  half  a  parent  element,  we  set 
6 o  equal  to  the  hyper-parameters  of  the  parent  element. 


The  optimization  problem  does  not  necessarily  have  a  unique  maximum.  In 
reality,  different  local  maxima  are  associated  with  different  interpretations  of 
the  observed  data  set  (Ch.  5  of  [22]).  In  our  numerical  examples,  we  did  not 
encounter  any  problems  with  this  optimization  and  the  maxima  we  obtained 
were  quite  robust.  Powers  of  the  response  function  are  also  treated  as  MGPs 
with  SE  covariance  function,  albeit  having  their  own  hyper-parameters  6q 
(see  Section  2.2).  These  are  also  selected  by  maximizing  the  marginal  likeli¬ 
hood. 


Evaluation  of  the  integrals.  Finally,  we  come  to  the  problem  of  computing 
the  necessary  integrals  for  the  evaluation  of  the  statistics  (Eqs.  (24),  (25) 
and  (26)).  It  is  apparent  that  for  general  elements,  input  probability  dis¬ 
tribution  and  covariance  function,  these  integrals  have  to  be  numerically 
evaluated.  We  choose  to  work  with  square  elements,  uniform  input  probabil¬ 
ity  distribution  and  SE  covariance  function.  With  this  choice,  it  is  possible 
to  express  those  integrals  analytically  using  the  error  function: 

$(x)  =  -%=  fXe-t2dt.  (A.2) 

v7r  Jo 


In  particular,  let  X*  =  x 
X\  i.e. 


k=iiak^k\  and  Pl(z)  be  the  uniform  distribution  on 


pl(z) 


(A.3) 
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Then,  it  is  easy  to  show  that  (for  q  —  1): 


Lk 

(A.5) 


The  constant  c1  (Eq.  (25)),  can  be  trivially  shown  to  be 


The  integrals  that  pertain  to  the  higher  moments  q  >  1  are  obtained  similarly 
by  replacing  the  hyper-parameters  with  the  ones  that  correspond  to  the  MGP 
representing  the  response  raised  to  the  q  power  f9. 
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